This code chunk is where we load in all of the packages that we will use in this script using library(packagename)
Packages allow us to interact with our data through the use of functions -> function(data)
If you get the error: “package”packagename” does not exist”:
Here is where we read in our data
We use read_csv from the tidyverse because the file (dataset) we want to read in is a csv (comma separated values)
We use here to cut out the long file paths and start within our R project
penguins_data <- read_csv(here("data/penguins_data/penguins_lter.csv"))
## Rows: 344 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
## dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Clean column names of the dataframe using the clean_names function from the janitor package
penguins_data <- penguins_data %>%
clean_names()
Exploratory Analysis
Let’s begin by exploring some of the columns of our dataset
There is an island column - Let’s see how many different islands there are in the dataset and how many penguins from the study are on each island - We can use the ‘table’ function to accomplish this
table(penguins_data$island)
##
## Biscoe Dream Torgersen
## 168 124 52
Now, let’s look at the body mass column by creating a histogram
ggplot(data = penguins_data, aes(x = body_mass_g)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
This looks good, now let’s clean this up and look at body mass by species of penguin
ggplot(data = penguins_data, aes(x = body_mass_g,
fill = species)) +
geom_histogram(color = "black") +
theme_minimal() +
labs(title = "Penguins, Palmer Station LTER",
subtitle = "Body Mass Distribution for Adelie, Chinstrap and Gentoo Penguins",
x = "Body mass (g)",
y = "Number of Penguins",
color = "Penguin species")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
Now let’s look at the relationship between flipper length and body mass with a scatter plot
ggplot(data = penguins_data, aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
It looks like there is a strong positive correlation between body mass and flipper length - Let’s add some variables from the dataset - Island - Species - Sex
ggplot(data = penguins_data, aes(x = flipper_length_mm,
y = body_mass_g,
color = species,
shape = sex)) +
geom_point() +
theme_minimal() +
labs(title = "Penguin size, Palmer Station LTER",
subtitle = "Flipper length and body mass for Adelie, Chinstrap and Gentoo Penguins",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Penguin species",
shape = "Penguin sex") +
theme(axis.text.x = element_text(angle = 45)) +
facet_grid(~island)
## Warning: Removed 10 rows containing missing values (`geom_point()`).
Now let’s explore the relationships between our numeric columns in the dataset with a correlation matrix
Select numeric columns for correlations
penguins_data_numeric <- penguins_data %>%
select(culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g)
Create a correlation matrix
cor_matrix <- cor(penguins_data_numeric[complete.cases(penguins_data_numeric), ], use = "pairwise.complete.obs")
Plot the correlation matrix
corrplot <- ggcorrplot(cor_matrix, type = "lower", outline.color = "white") +
theme(axis.text.x = element_text(size = 3),
axis.text.y = element_text(size = 3))
corrplot
Make the correlation plot interactive
ggplotly(corrplot)